Solving large nonlinear generalized eigenvalue 
problems from Density Functional Theory 
calculations in parallel 

Claus Bendtsen^ Ole H. Nielsen^ and Lars B. Hansen^ 

""UNImC, DTU Bldg. 304, DK-2800 Lyngby, Denmark 
^Dept. of Physics, DTU Bldg. 307, DK-2800 Lyngby, Denmark 

Abstract 

The quantum mechanical ground state of electrons is described by Density Func- 
tional Theory, which leads to large minimization problems. An efficient minimization 
method uses a selfconsistent field (SCF) solution of large eigenvalue problems. The 
iterative Davidson algorithm is often used, and we propose a new algorithm of this 
kind which is well suited for the SCF method, since the accuracy of the eigenso- 
lution is gradually improved along with the outer SCF-iterations. Best efficiency 
is obtained for small-block-size iterations, and the algorithm is highly memory ef- 
ficient. The implementation works well on both serial and parallel computers, and 
good scalability of the algorithm is obtained. 

1 Introduction 



Within recent years it has become possible to perform quantum mechanical 
calculations from first principles using the Density Functional Theory (DFT) 
of Hohenberg, Kohn and Sham (see e.g. [18]). Realistic calculations of the 
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electrons' ground-state can be carried out for large systems consisting of tens 
to hundreds of atoms owing to improvements in physical methods, faster com- 
puters, and more efficient algorithms. In this way modeling of a vast number 
of physical and chemical properties can be carried out, often taking large 
computer systems to their limits. 

The Density Functional Theory specifies that the ground-state total energy 
£■0 can be obtained as the global minimum of the energy functional E\p\, 

Elp] = T\p] + Ke[p] + / p{r)v,,t{r)dr + 7Ewaid (1) 

Here the position vector in space is denoted as r, p(r) (or p for short) is the 
charge density of electrons satisfying the constraint / p{r)dr = N, where N 
is the integral number of electrons in the system. The i'ext(r) is the "external" 
potential function acting on the electrons and originating from the atomic 
nuclei at fixed positions in the system. The kinetic energy T[p] as well as 
the electron-electron interaction energy Vee[p\ are functionals of the density p, 
only. Finally, an electrostatic repulsion energy between the nuclei, denoted as 
7Ewaid5 is added. The temperature is taken to be at the absolute zero so that 
entropy terms are left out in the equations below. 

Kohn and Sham[18] proved that the charge density p(r) can be represented 
by a system of N non-interacting electrons, whose complex-valued quantum- 
mechanical wavefunctions ipi{r) for i — 1, . . . , N yield the charge density as 

N 

pir)=EMr)\' (2) 

i=l 

(for simplicity, we only consider integral occupation numbers of or 1) and 
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the non-interacting kinetic energy functional Tg as 

^ r - 

Ts\p]^ -\Y. U^)'^^U^)dv (3) 

(■0i(r) denotes the complex conjugate of The energy functional Eq. 1 

can now be rewritten as 

-^[P] — + -S'Hartree [p] + E^c\p\ + j P{T'^)Ve^{T)dY + 7Ewald (4) 

where the ^^HartreeM is the classical electrostatic energy of the electrons, and 
-Sxcfp] is an (unknown) exchange- correlation energy functional. 

The wavefunctions are the i — 1, . . . ,N lowest eigensolutions of the 

Kohn-Sham equation 

Hi,,{v) = [-|V2 + ^;eff(r)]^.(r) = e,^,(r) (5) 

where H is known as the Hamiltonian operator. The effective potential Ves{r) 
is the sum of the external, the Hartree (electrostatic), and the exchange- 
correlation potentials: 



^eff(r) = ^;ext(r) + ^^^^^ + = ^ext(r) + J J^—^\dr + v,,{r) 

(6) 



It is seen from Eq. 6 that the potential Ves{r) depends on the charge density 
p(r) , which itself is given by Eq. 2 as a sum of squared wavefunctions that are 
determined as eigensolutions of Eq. 5. Hence a self- consistent solution of these 
equations must be found starting from an initial guess of p(r), from which 
the Ves{r) is constructed; solving the Kohn-Sham eigenvalue equation yields 
through Eq. 2 a renewed charge density. A minimization algorithm for E[p\ 
must be used to obtain the ground-state total energy Eq from such iterations. 
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An additional complication in some of the current approaches is the use of ul- 
trasoft pseudopotentials[27] that puts a lower requirement on the basis set size, 
but at the price of a more complicated and computationally more demanding 
Hamiltonian H. This leads to replacing the Kohn-Sham equation Eq. 5 by a 
generalized eigenvalue equation, 

H^,{v) = e,S^,{v) (7) 

where the S denotes the overlap projection operator [27]. 

In most numerical calculations the wavefunctions tl^iiv) are expanded on a suit- 
able, finite set of basis Junctions. The present work employs the widely used 
pseudopotential approximation together with plane-wave (or Fourier expan- 
sion) basis sets. In the plane-wave basis the Hamiltonian Eq. 5 is diagonally 
dominant owing to the kinetic energy term. In this method full, complex Her- 
mitian matrices of sizes 10'^ — 10^ are usually encountered, whereas the number 
of electrons N may typically be two orders of magnitude smaller, 10^ — 10^. 
However, the algorithms described below remain valid for very general classes 
of problems using other basis function sets. 

The translational symmetry of crystals is dealt with by introducing a summa- 
tion over the so-called k-points (see e.g. [10]). This leads to a set of independent 
eigenvalue problems for each k-point, which are coupled only through the ad- 
dition of charge densities in Eq. 2. Similarly, electron spin can double the num- 
ber of independent eigenvalue problems to be solved. Such almost-decoupled 
eigenvalue problems should always be solved independently in parallel, since 
this will be optimally efficient and can easily be combined with the paralleliza- 
tion described below, leading to the possibility of utilizing large numbers of 
processors efficiently. 
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The discretized problem originating from Eq. 7, which essentially has to be 
solved for the smallest eigenvalues, is the non-linear matrix eigenvalue prob- 
lem, 



i = eiS^jji, i = 1, . . . ,N 



(8) 



N 



(9) 



i=l 



where the overlap matrix S is a symmetric positive definite matrix when ultra- 
soft pseudopotentials[27] are used (and the identity matrix otherwise). This 
implies that the eigenvalues, are real and the eigenvectors, ijji mutually 
S-orthogonal. 

Given an initial charge density vector p (reasonably good initial values for p 
can usually be constructed using the densities of the constituent free atoms), 
Eq. 8 can be solved for the wave functions and the resulting charge density 
can subsequently be computed using Eq. 9. A new input charge density can 
finally be formed, and the cycle can be iterated until convergence. The overall 
procedure is commonly termed Self Consistent Field (SCF). 

Alternative approaches to the SCF method exist, notably direct minimization 
of the energy as a functional of the wavefunctions ipiir) instead of the charge 
density p, see for example [26]. It has been shown[10] that this method has 
about the same efficiency as the traditional SCF method. Another active area 
of research tries to achieve order-N scaling by, e.g., carefully selecting alterna- 
tive basis functions. However, a discussion of the numerical methods proposed 
in various order-N methods are beyond the scope of this paper but a recent 
review can be found in [9]. 
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2 Solving the Non-linear Eigenvalue Problem 

Using a quasi newton method for the iterations of Eq. 8 and Eq. 9 generally 
results in convergence within a small number of iterations. The iterations 
are usually carried out by the method Pulay[19], see e.g. [10] for a review 
of current methods. We propose a more robust and slightly more efficient 
approach, a quasi newton method as described in Appendix A. Experiments 
with the limited memory BFGS method[12] indicated that it has an almost 
similar efficiency but is less robust. 

Solving the eigenvalue problem Eq. 8 is computationally by far the most ex- 
pensive part of the algorithm because of the cost of the li{p)ipi product. The 
method of Davidson[6] has been especially successful for electronic structure 
computations, and has been improved by iterating on several eigenpairs simul- 
taneously (see e.g. [7]) and using better preconditioners[14,16,21]. Lately, bet- 
ter theoretical understanding of the methods has contributed to further gen- 
eralizations such as the Jacobi-Davidson method[24,23] and restarting tech- 
niques[3,5,25] along with other iterative methods for large eigenvalue problems 
based on Rayleigh quotient iterations [8], inverse iteration[ll] or the Lanczos 
method[15,4]. A currently advocated method in this field is the residual mini- 
mization method, RMM-DIIS[10] but as this method essentially just computes 
eigenpairs closest to the initial ones and therefore easily can result in eigenpairs 
being missed, we cannot recommend this approach. 

A major drawback — in the present context — of the above methods is that they 
all (except for RMM-DIIS) focus on solving the eigenvalue problem to a fairly 
high accuracy. However, this turns out not to be necessary during every step 
of the selfconsistency cycle; in fact, our calculations have shown that solving 
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the eigenvalue problem to a high accuracy generally does not decrease the 
number of overall steps in the quasi newton/fixpoint iteration noticeably, and 
in certain cases even increases the number of steps. 

Thus we seek an eigenproblem solver which, given an initial estimate of N 
eigenpairs, is able to improve this estimate (i.e. perform a relatively small 
number of iterations) for each of the eigenpairs. In addition, an important 
aspect of large-scale applications is that the memory requirement must be 
minimized. In order to achieve these objectives, we investigated a number of 
algorithms for improving eigenpairs so that residuals are improved only until 
a certain, adjustable limit. 

Using deflation techniques[22,17] we experienced convergence problems when 
deflating eigenpairs which were not well converged, and therefore the overall 
eigenvalue problem had to be solved to a much higher accuracy (and at a 
higher computational cost) than actually required. Similar experiences have 
been reported in [21], where the convergence tolerance was loosened as more 
and more eigenpairs converged. 

We instead propose an algorithm for solving the generalized eigenvalue prob- 
lem Eq. 8, which essentially is a blocked Davidson-like algorithm using a non- 
orthogonal basis, but where a maximum number of expansions on each eigen- 
pair can be imposed. 

The reason for choosing a non-orthogonal basis, B (and thus obtaining a gener- 
alized eigenvalue problem in the projected space, step 1 and Sb.i.C) is, besides 
from reducing the computational cost compared to using an S-orthonormal 
basis, that it reduces the number of synchronization points in the parallel 
implementation (even when compared to an orthogonalization algorithm im- 
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plemented with delayed summation as in e.g. [21]). The drawback is that 
projected eigenvalue problem, step 1 and Sb.i.C could be poorly conditioned 
and thus slow down the convergence[17, Sec. 11.10]. Our initial basis (step 1) 
is however well conditioned for every quasi newton iteration (as the estimate of 
eigenvectors ■ ■ ■ , V'jy almost S-orthogonal and S is close to the identity) 
and as only a few expansions are performed for each eigenpair (typically 2 
to 3) a poorly conditioned (projected) eigenvalue problem is unlikely to build 
up to the point where it becomes a problem. This is due to the fact that the 
eigenvectors do not have to be converged to a very high accuracy. In fact, 
introducing orthogonalization in the algorithm has not led to fewer iterations 
for a wide variety of tested physical problems. 

The algorithm is as follows: 

(1) Given an estimate . . . ,i/Jj^ for the N eigenvectors, initialize the sub- 
space basis B = [ki, . . . ,bj^] — [ijj^,... and solve the small, pro- 
jected subspace eigenvalue problem of dimension N: 

B^HB0. = AiB^SB0., i^l,... ,N, where Ai < A2 < . . . < Ajv- 

Select an iteration block-size rii, and a maximum number of basis vectors 
JT-max and a maximum number of iterations on each eigenpair k^ax- 

(2) Initialize the set /C = 0. 

(3) Loop over all eigenpairs, i = 1, . . . , N 

(a) If the i'th eigenpair Aj, 4>^ has less than k^ax iterations then compute 
the corresponding residual — HB0^ — AjSB^^. If ||rj|| is larger than 
some tolerance, add i to /C. 

(b) If /C has Ub elements or i — N: 

(i) Do an iteration on eigenpairs given by indexes in /C: 
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(A) "Precondition" the residualsQ corresponding to the uncon- 
verged eigenpairs, i.e. solve approximately for Vz G /C : 
(H - \iS)Xi = -Zi. 

(B) Extend the basis, Vz G /C : B = [B, 

(C) Solve the updated subspace eigenproblem for the N small- 
est eigenvalues, 

B^HB0. = AiB^SB0., i = 1, . . . , AT, 

where Ai < . . . < \n- 

(ii) If the number of basis vectors in B is too large (i.e. larger than 
^max — i^b) then collapse the subspace (restart): 

B = B[0^,... ,^^]. 

(iii) Set /C = 0. 

(4) Update [^^,... ,^^] =B[0^,... ,0^] 

The tolerance of step 3a is set to an order of magnitude smaller than the 
current residual norm of the quasi newton process (i.e. Hr'^lh, page 17), but 
no larger than 0.1. The tolerance will therefore decrease with the convergence 
of the quasi newton iteration. 

Since the most time consuming part of the eigenvalue solver is computing the 
products of H and S by vectors in the steps 1, 3a and 3b.i.C these vectors are 
stored and reused whenever possible. Thus the dominant memory requirement 

^ The usual physical approximation in [26] is used, where the Hamiltonian is ap- 
proximated by the diagonal (and diagonally dominant) kinetic energy term in Eq. 
5. A smooth function is multiplied onto the kinetic term in order to damp the high- 
frequency errors. [26] 
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of the algorithm is 0{3nn^ax), where n is the vector length given by the 
dimension of Eq. 8. 

Typical problem sizes might he N ^ 10^ — 10^, and as it will be shown in Sec. 
4 the parameters should typically be chosen so fcmax ~ 1 to 2, ~ 10 and 
"n-aiax ~ -/V + 5 * nb, SO the memory requirement is very close to the absolute 
minimum, which is the 3N vectors ipi, Hipi and Sipi for i — 1, ...,N. Lower 
memory consumption could be achieved by eliminating the storage of Hz/^j 
and Sipi and recomputing these products as required in steps 3a and Sb.i.C, 
however with the tradeoff of requiring about twice as many multiplications of 
H and S by vectors. 

Most floating-point operations stem from the very complicated products of H 
and S by vectors (the number of such operations should be minimized by any 
algorithm). In addition, the present algorithm contains significant amounts of 
linear algebra (approximately 8nN flops for each H product) which however 
can be carried out efficiently using BLAS level 3 operations. 

The asymptotic scaling of the total time spent on the matrix-products for each 
quasi newton iteration is Nnlog{n) due to the FFT-technique used, whereas 
the linear algebra scales as N'^n. Since the matrix size n increases roughly 
linearly with N , it is obvious that the linear algebra would dominate asymp- 
totically. For the sizes of our problems, however, the matrix products generally 
take longer than the linear algebra. 
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3 Parallel Implementation 

The products of H times vector parallelize particularly poorly, since they in- 
volve 3D FFTs of fairly small size (typically ~ 50) in each dimension. There- 
fore the parallelization for a distributed memory computer is not viable solely 
across the plane waves (vectors of length n). By using a block size equal 
to the number of available processors, it is however possible to perform Ub 
H-products in parallel using local FFTs exclusively - each on a different pro- 
cessor (the Ub may also be a multiple of the number of processors, requiring 
additional memory, however). This approach is used in the implementation for 
the initial calculation of the H and S products in step 1 and for step 1, where 
the Ub residual vectors each are gathered on different processors before precon- 
ditioning them. After the preconditioning, H and S products are formed for 
the preconditioned vector, and the results are finally redistributed across the 
processors. This requirement on the block size can be lifted on computers 
that perform small-size parallel 3D FFTs with high efficiency. 

For the remaining part of the code the parallelization is performed across the 
plane waves (i.e. each basis- vector is partitioned among the processors). This 
requires a fairly minimal amount of communication (essentially only a small 
number of reduction operations). 

The parallel implementation was done using MPI[13], and some typical results 
indicating the scaling for runs carried at on the IBM SP0 are given in Table 
10. 

^ Located at UNI'C in Lyngby, Denmark. 

^ This Platinum (Pt) problem is a an infinite slab with 35 atoms in the unit cell. 

The parameters used for the run were: vector length n = 16409 (25 Ry plane- 
wave cutoff energy), N = 210, = # of processors, nmax = N + 8nb, number of 
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Table 1 



Parallel runs of the algorithm for a Pt problem. 



# of processors 


execution time 


speed-up 


1 


9544 (10932) 


- 


2 


4208 (4671) 


2.3 


4 


1860 (2192) 


5.1 


6 


1152 (1421) 


8.3 


8 


901 (1197) 


10.6 


12 


634 (873) 


15.1 



It is seen that the speed-up of the eigenvalue solver is very satisfactory for 
these problem sizes, and larger problems are expected to scale up to a corre- 
spondingly higher number of processors. Notice that the superlinear speed-up 
is partly due to the fact that the blocksize and maximum subspace dimen- 
sion varies with the number of processors which implies that the 1 processor 
run did not benefit from BLAS level 3 operations. A sequential run with the 
same parameters as the 4 processor parallel run led to a speed-up of 3.0 on 4 
processors. 

These runs were made using only a single "k-point" (see Introduction). In 
production runs the additional trivial parallelism of a number of k-points and 
spins are utilized. The parallelization of the eigenvalue solver is duplicated 
across mutually exclusive sets of processors for each k-point and spin in an 

selfconsistent iterations = 4. The execution time is given for the eigensolver but with 
time for the complete runs including initialization (not yet fully parallelized) given 
in parenthesis. 
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optimally efficient way. Our work uses MPI communicators to implement this 
in a simple way. However, memory consumption is not reduced by exploiting 
this extra parallel dimension. 



4 Numerical Examples 

In order to investigate the stability of the proposed algorithm and its sensi- 
tivity to choice of parameters, a large number of different examples have been 
tested. Some representative examplesj^ are shown in Fig. 1, 2 and 3. 



other operations 
Residual comp. 
Restarting 

Solving projected eigensystem 
H and S products 





Fig. 1. Time spent in different parts of the code for an Au problem, using 6 processors 
{rib = 6), Ajjnax = 1 and varying the maximum number of basis vectors, nmax- 

The experiments indicate that the size of n?, is not critical as long as it is 
chosen large enough for the solution of the projected eigensystem not to be 



^ The Gold (Au) problem shown in the figures is an infinite slab with 18 atoms in 
the unit cell. The parameters used for the run were: vector length n = 13665 (30 Ry 
plane- wave cutoff energy), = 198, number of selfconsistent iterations = 15. 
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unnecessarily large. The maximum subspace dimension, rij^ax is however im- 
portant - if nmax is too small the overhead in restarting becomes dominant, on 
the other hand if n^^x becomes unnecessarily large the memory consumption 
and the cost of solving the projected eigenvalue problem becomes large. 




From Fig. 3 it is seen that the overall work grows linearly when increasing 
the precision to which we solve the eigenvalue problem, indicating that it 
is not worthwhile to solve the eigenvalue problem to a high accuracy. This is 
furthermore supported by Fig. 4 which shows that the convergence of the quasi 
newton process is largely unaffected by increasing the number of iterations on 
each eigenpair, /Cmax- 
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2500 



Other operations 
Residual comp. 
Restarting 

Solving projected eigensystem 
H and S products 




Fig. 3. Time spent in different parts of the code for an Au problem, using 6 processors 
{rih = 6), nmax = 180 and varying the maximum number of expansions on each 
eigenpair, k^a.^- 




Iteration number 



Fig. 4. Convergence of the quasi newton process for an Au problem for varying 

maximum number of expansions on each eigenpair, femax- 
5 Conclusion 



The algorithm proposed in this paper presents an efficient way to solve the 

nonlinear generalized eigenvalue problems occuring in DFT calculations on 
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multiple processors. Part of the algorithm is a variant of the Davidson algo- 
rithm [7] which focuses on using only small amounts of work for each eigenpair 
in each of the quasi newton iterations, and thereby being able to use a non- 
orthogonal basis. 

In contrast, our best effort using traditional deflation techniques typically 
required 3 to 5 times the amount of overall work, as each of the eigenpairs 
had to be converged to a much higher precision in order for the deflation to 
be robust. Attempts were made to improve on this and an algorithm using 
deflation which was almost as efficient as the algorithm proposed on page 8 
was devised. However, it involved complicated heuristics especially for treating 
eigenvalues with higher multiplicity (which often occur due to symmetries in 
the systems) and this additional effort did not seem worthwhile. 

In summary, the algorithm proposed in the present paper has been designed 
for nonlinear eigenvalue problems such as those occurring in selfconsistent 
DFT calculations. The algorithm has shown robustness and efficiency, is flex- 
ible in accomodating memory limitations, and allows constraints in the num- 
ber of iterations and the tolerance of eigenpairs. Several parameters can be 
used to optimize the convergence, and we propose recommended values for 
DFT iterations. Good parallel performance is achieved on distributed mem- 
ory computers, but since the eigenvalue problem in the projected subspace is 
not parallelized, the parallel scalability may be limited, independent of the 
efficiency of the interconnecting network. 
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A The Projected Newton Method 



Given a series of input charge densities, p'^'"^, ■ ■ ■ , p'^^ {k > m > 1) for Eq. 8, 
the corresponding output charge densities defined by Eq. 9 will be denoted as 
. . . The residual of the charge density iteration is then defined by 

!—out !—m 

of which the Jacobian can be approximated by TfcG|., where denotes a 
pseudo inverse of G, and we have defined 



k-l 
in ^in 



G, = {Ap'-^+\ . . . , V), V = Pf„ - 

Tfe = (Ar'=-"^+\ . . . , Ar''), Ar'^ = r'^ - r'^'K 

Applying Newton's method gives 

p^;^'^pl-G,TlrJ (A.l) 

where = T|.r*^ is obtained as the least squares solution of 

TkOk = r\ (A.2) 



It is seen that the fixpoint solution p* = p^ corresponding to r:^ = will 
never be found unless p* — (where i is equal to the initial k used) belongs 
to the span of Gj (i.e., the span of G^ remains constant). Therefore, a term 
i^ij^ — Tfcftfc) is added to the right hand side of Eq. A.l for some arbitrarily 
chosen full rank, diagonal matrix /? as proposed in [1] (notice that = T^kQ^k 
if Tfc has full rank). 

The least squares system Eq. A.2 is solved via the thin QR factorization on 
the augmented matrix [T^a^] as described in e.g. [2], and the factorization is 
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updated rather than recomputed in the following iteration using the routines 
from [20]. 

As the above algorithm can not be used for the first iteration, the value = 

o ' i_m 

+ Pl^ is used instead. For the subsequent iterations the above procedure is 
applied with m incremented by one for each iteration (starting with m — 1) 
until a maximum value mmax is reached. Choosing mmax in the range of 5 to 
10 generally works well for the present DFT problems. 
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